An Ensemble Spectral Prediction (ESP) model for metabolite annotation

Abstract Motivation A key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities. Currently, only a small fraction of measurements can be assigned identities. Two complementary computational approaches have emerged to address the annotation problem: mapping candidate molecules to spectra, and mapping query spectra to molecular candidates. In essence, the candidate molecule with the spectrum that best explains the query spectrum is recommended as the target molecule. Despite candidate ranking being fundamental in both approaches, limited prior works incorporated rank learning tasks in determining the target molecule. Results We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation. ESP takes advantage of prior neural network-based annotation models that utilize multilayer perceptron (MLP) networks and Graph Neural Networks (GNNs). Based on the ranking results of the MLP- and GNN-based models, ESP learns a weighting for the outputs of MLP and GNN spectral predictors to generate a spectral prediction for a query molecule. Importantly, training data is stratified by molecular formula to provide candidate sets during model training. Further, baseline MLP and GNN models are enhanced by considering peak dependencies through label mixing and multi-tasking on spectral topic distributions. When trained on the NIST 2020 dataset and evaluated on the relevant candidate sets from PubChem, ESP improves average rank by 23.7% and 37.2% over the MLP and GNN baselines, respectively, demonstrating performance gain over state-of-the-art neural network approaches. However, MLP approaches remain strong contenders when considering top five ranks. Importantly, we show that annotation performance is dependent on the training dataset, the number of molecules in the candidate set and candidate similarity to the target molecule. Availability and implementation The ESP code, a trained model, and a Jupyter notebook that guide users on using the ESP tool is available at https://github.com/HassounLab/ESP.

Figure S1: A conceptual framework for solving the three subproblems involved in the spectrum-to-molecule annotation problem: (1) representation learning of query spectra, (2) molecular attribute prediction from spectral representation, and (3) ranking of candidate molecular attributes against predicted attributes.Shown is a prediction of molecular attributes; however, the ranking is still applicable when predicting candidate de novo molecular structures.

S2 The Presence of difficult-to-rank molecules
We plot the distribution of the molecules at each rank (Fig. S2).All models do well in predicting the target molecule correctly for a great number of molecules, and thus the high number of molecules at small rank values.However, all models are challenged by difficult-to-rank molecules that result in high rank.These molecules directly impact the average rank.
Our work herein in terms of peak dependency considerations and the learning on rank address this challenge.Both MLP and GNN models improve in this regard when including peak dependencies (Fig. S2A,B).The performance of ESP on the difficult-to-rank molecules is also improved when compared to the ESP-SL and the ESP-RU models (Fig. S2C,D), thus supporting the improvement in average rank for ESP over these two models.

S3 Ablation studies on the ESP Model
To provide further detailed evaluation on the ESP model components, we evaluate the following variations of the model and report the results in Table S1.
• MLP -Baseline • GNN -Baseline • MLP-MixL -MLP + label-mixing layer • GNN-MixL -GNN + label-mixing layer • MLP-LDA -MLP + multitasking on spectral motifs • GNN-LDA -GNN + multitasking on spectral motifs • MLP-PD -MLP + label-mixing layer + multitasking on spectral motifs • GNN-PD -GNN + label-mixing layer + multitasking on spectral motifs • CONCAT-ENS -Ensemble model using the concatenation of MLP and GNN embedding • AVG-ENS -Ensemble model taking the average of MLP and GNN spectral predictions, thus weighting each equally of the predicted spectra equally • ESP-SL -Ensemble classifier is trained using importance weights in proportion to the spectral loss differences (not rank differences).
• ESP-RU -Ensemble classifier is trained on the GNN/MLP labels generated based on ranking results, but each training example is weighted uniformly.
Based on these results, the ESP model is the best performing model in every category except average rank, where the performance is marginally worse than ESP-RU.

S4 Candidate set distributions
Distributions on the number of candidate sets retrieved from PubChem for each molecule in our test set show a long tail distribution (Fig. S3A).The average number of candidates was 4,728 with a maximum of 48,292 candidates.The similarity of the candidate sets to the target molecule shows a normal distribution (Fig. S3B).The sets with high molecular similarities indicate that there are candidates that are difficult to rank.Our experiments show that ESP (and other models) are more challenged by high similarity candidate sets.There is weak correlation between the similarity and size of the candidate sets (R 2 is -0.63) (Fig. S3C).

S5 ESP, MLP-PD and GNN-PD performance as a function of candidate set size and similarity
Both MLP-PD and GNN-PD models are more challenged with increasing dataset sizes from 50, to 100 to 250 to 1,000 candidates (Fig. S4).ESP outperforms GNN-PD and MLP-PD on the least similar candidate sets (Fig. S5A).However, on the most similar candidate set, MLP-PD outperforms ESP on ranks 1-3, but ESP outperforms MLP-PD on higher ranks (Fig. S5B).See also    S6 MLP-PD and GNN-PD performance on realistic data splits and full-positive mode The t-SNE plot shows distinct clusters on the test molecules (Fig. S6A).Under the realistic split, the models are trained on the larger clusters and tested on the smaller clusters.ESP marginally outperforms GNN-PD under the realistic split (Fig. S6B).When training on the full positive mode dataset, performance drops for all models (Table S2, Fig. S6C).

S7 Annotation example
We provide an annotation example to highlight the influence of candidate similarity on candidate ranking performances across all models.(Fig. S7).While the three baseline models rank the target molecule among the top 3 candidates, ESP provides the correct ranking for the target molecule.MLP-PD and GNN-PD provide improved ranking over the baselines, but rank the target molecule in second position.

S8 Peak dependency modeling
A label-mixing layer allows modeling peak dependencies (Fig. S8).The spectra with label-mixing, ŷco , is computed based on L label-mixing layers.Label-mixing is learned through a lower dimensional matrix D.

Figure S2 :
Figure S2: Number of test molecules at a particular rank.Our model improvements address the difficultto-rank molecules and hence improve the average rank.A) MLP-PD vs MLP model.B)GNN-PD vs GNN model.C) ESP vs ESP-SL model D) ESP vs ESP-RU model.

Figure S3 :
Figure S3: Profiling candidate molecules retrieved from PubChem.A) Histogram of number of candidates (xaxis) for the test molecules.B) Histogram of pairwise MACCS fingerprint similarity between target molecules and their respective candidates.C) Scatter plot of candidate sets showing the size of the candidate-set (xaxis) against similarity between target and candidates in each candidate set.

Figure S5 :
Figure S5: Comparing rank@k performance for ESP, GNN-PD, MLP-PD models for different candidate sets: A) least similar candidate sets, B) most similar candidate sets.

Figure S6 :
Figure S6: Analysis for realistic data splits.A) Realistic split t-SNE plot, where grey clusters are.used for training and red clusters are used for test).B) Comparing rank@k performance for ESP, GNN-PD, MLP-PD models under realistic split set.C) Comparing rank@k for the full positive ion data set.

Figure S7 :
Figure S7: Metabolite annotation example for target molecule Propafenone.Shown are the 10 most and least similar candidates with their respective fingerprint similarity scores.

Figure S8 :
Figure S8: Using label mixing to capture co-occurring spectral peaks.This mixing is applied based on previous prediction ŷ, co-occurrence matrix Q, and weight matrix τ .The co-occurrence matrix, Q, is approximated from the learned lower-dimension matrix D.

Table S1 :
Ablation study on the ESP model.See text on ablation studies in this supplement for experimental details.The evaluation is on[M+H]precursor mode data for 100-molecule average candidate set size under random data split.Average rank reports on the overall performance on the test set.Rank@k represents the portion of correct identifications when considering the top k candidates.

Table 1 (
B) in the main text.

Table S2 :
Metabolite annotation evaluation on full positive ion mode.